{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Components are the lowest level of building blocks in EvalML. Each component represents a fundamental operation to be applied to data.\n",
    "\n",
    "All components accept parameters as keyword arguments to their `__init__` methods. These parameters can be used to configure behavior.\n",
    "\n",
    "Each component class definition must include a human-readable `name` for the component. Additionally, each component class may expose parameters for AutoML search by defining a `hyperparameter_ranges` attribute containing the parameters in question.\n",
    "\n",
    "EvalML splits components into two categories: **transformers** and **estimators**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transformers\n",
    "\n",
    "Transformers subclass the `Transformer` class, and define a `fit` method to learn information from training data and a `transform` method to apply a learned transformation to new data.\n",
    "\n",
    "For example, an [imputer](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.SimpleImputer) is configured with the desired impute strategy to follow, for instance the mean value. The imputers `fit` method would learn the mean from the training data, and the `transform` method would fill the learned mean value in for any missing values in new data.\n",
    "\n",
    "All transformers can execute `fit` and `transform` separately or in one step by calling `fit_transform`. Defining a custom `fit_transform` method can facilitate useful performance optimizations in some cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from evalml.pipelines.components import SimpleImputer\n",
    "\n",
    "X = pd.DataFrame([[1, 2, 3], [1, np.nan, 3]])\n",
    "display(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import woodwork as ww\n",
    "\n",
    "imp = SimpleImputer(impute_strategy=\"mean\")\n",
    "\n",
    "X.ww.init()\n",
    "X = imp.fit_transform(X)\n",
    "display(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a list of all transformers included with EvalML:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.pipelines.components.utils import all_components, Estimator, Transformer\n",
    "\n",
    "for component in all_components():\n",
    "    if issubclass(component, Transformer):\n",
    "        print(f\"Transformer: {component.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimators\n",
    "\n",
    "Each estimator wraps an ML algorithm. Estimators subclass the `Estimator` class, and define a `fit` method to learn information from training data and a `predict` method for generating predictions from new data. Classification estimators should also define a `predict_proba` method for generating predicted probabilities.\n",
    "\n",
    "Estimator classes each define a `model_family` attribute indicating what type of model is used.\n",
    "\n",
    "Here's an example of using the [LogisticRegressionClassifier](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.LogisticRegressionClassifier) estimator to fit and predict on a simple dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.pipelines.components import LogisticRegressionClassifier\n",
    "\n",
    "clf = LogisticRegressionClassifier()\n",
    "\n",
    "X = X\n",
    "y = [1, 0]\n",
    "\n",
    "clf.fit(X, y)\n",
    "clf.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a list of all estimators included with EvalML:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.pipelines.components.utils import all_components, Estimator, Transformer\n",
    "\n",
    "for component in all_components():\n",
    "    if issubclass(component, Estimator):\n",
    "        print(f\"Estimator: {component.name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining Custom Components\n",
    "\n",
    "EvalML allows you to easily create your own custom components by following the steps below.\n",
    "\n",
    "### Custom Transformers\n",
    "\n",
    "Your transformer must inherit from the correct subclass. In this case [Transformer](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.Transformer) for components that transform data. Next we will use EvalML's [DropNullColumns](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.DropNullColumns) as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.pipelines.components import Transformer\n",
    "from evalml.utils import (\n",
    "    infer_feature_types,\n",
    ")\n",
    "\n",
    "\n",
    "class DropNullColumns(Transformer):\n",
    "    \"\"\"Transformer to drop features whose percentage of NaN values exceeds a specified threshold\"\"\"\n",
    "\n",
    "    name = \"Drop Null Columns Transformer\"\n",
    "    hyperparameter_ranges = {}\n",
    "\n",
    "    def __init__(self, pct_null_threshold=1.0, random_seed=0, **kwargs):\n",
    "        \"\"\"Initalizes an transformer to drop features whose percentage of NaN values exceeds a specified threshold.\n",
    "\n",
    "        Args:\n",
    "            pct_null_threshold(float): The percentage of NaN values in an input feature to drop.\n",
    "                Must be a value between [0, 1] inclusive. If equal to 0.0, will drop columns with any null values.\n",
    "                If equal to 1.0, will drop columns with all null values. Defaults to 0.95.\n",
    "        \"\"\"\n",
    "        if pct_null_threshold < 0 or pct_null_threshold > 1:\n",
    "            raise ValueError(\n",
    "                \"pct_null_threshold must be a float between 0 and 1, inclusive.\"\n",
    "            )\n",
    "        parameters = {\"pct_null_threshold\": pct_null_threshold}\n",
    "        parameters.update(kwargs)\n",
    "\n",
    "        self._cols_to_drop = None\n",
    "        super().__init__(\n",
    "            parameters=parameters, component_obj=None, random_seed=random_seed\n",
    "        )\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        \"\"\"Fits DropNullColumns component to data\n",
    "\n",
    "        Args:\n",
    "            X (pd.DataFrame): The input training data of shape [n_samples, n_features]\n",
    "            y (pd.Series, optional): The target training data of length [n_samples]\n",
    "\n",
    "        Returns:\n",
    "            self\n",
    "        \"\"\"\n",
    "        pct_null_threshold = self.parameters[\"pct_null_threshold\"]\n",
    "        X_t = infer_feature_types(X)\n",
    "        percent_null = X_t.isnull().mean()\n",
    "        if pct_null_threshold == 0.0:\n",
    "            null_cols = percent_null[percent_null > 0]\n",
    "        else:\n",
    "            null_cols = percent_null[percent_null >= pct_null_threshold]\n",
    "        self._cols_to_drop = list(null_cols.index)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, y=None):\n",
    "        \"\"\"Transforms data X by dropping columns that exceed the threshold of null values.\n",
    "\n",
    "        Args:\n",
    "            X (pd.DataFrame): Data to transform\n",
    "            y (pd.Series, optional): Ignored.\n",
    "\n",
    "        Returns:\n",
    "            pd.DataFrame: Transformed X\n",
    "        \"\"\"\n",
    "        X_t = infer_feature_types(X)\n",
    "        return X_t.drop(self._cols_to_drop)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Required fields\n",
    "\n",
    "- `name`: A human-readable name.\n",
    "\n",
    "- `modifies_features`: A boolean that specifies whether this component modifies (subsets or transforms) the features variable during `transform`.\n",
    "\n",
    "- `modifies_target`: A boolean that specifies whether this component modifies (subsets or transforms) the target variable during `transform`.\n",
    " \n",
    "#### Required methods\n",
    "\n",
    "Likewise, there are select methods you need to override as `Transformer` is an abstract base class:\n",
    "\n",
    "-  `__init__()`: The `__init__()` method of your transformer will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_seed` value. You can see that `component_obj` is set to `None` above and we will discuss `component_obj` in depth later on.\n",
    "\n",
    "- `fit()`: The `fit()` method is responsible for fitting your component on training data. It should return the component object.\n",
    "\n",
    "- `transform()`: After fitting a component, the `transform()` method will take in new data and transform accordingly. It should return a pandas dataframe with woodwork initialized. Note: a component must call `fit()` before `transform()`.\n",
    "\n",
    "You can also call or override `fit_transform()` that combines `fit()` and `transform()` into one method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom Estimators\n",
    "\n",
    "Your estimator must inherit from the correct subclass. In this case [Estimator](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.Estimator) for components that predict new target values. Next we will use EvalML's [BaselineRegressor](../autoapi/evalml/pipelines/components/index.rst#evalml.pipelines.components.BaselineRegressor) as an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from evalml.model_family import ModelFamily\n",
    "from evalml.pipelines.components.estimators import Estimator\n",
    "from evalml.problem_types import ProblemTypes\n",
    "\n",
    "\n",
    "class BaselineRegressor(Estimator):\n",
    "    \"\"\"Regressor that predicts using the specified strategy.\n",
    "\n",
    "    This is useful as a simple baseline regressor to compare with other regressors.\n",
    "    \"\"\"\n",
    "\n",
    "    name = \"Baseline Regressor\"\n",
    "    hyperparameter_ranges = {}\n",
    "    model_family = ModelFamily.BASELINE\n",
    "    supported_problem_types = [\n",
    "        ProblemTypes.REGRESSION,\n",
    "        ProblemTypes.TIME_SERIES_REGRESSION,\n",
    "    ]\n",
    "\n",
    "    def __init__(self, strategy=\"mean\", random_seed=0, **kwargs):\n",
    "        \"\"\"Baseline regressor that uses a simple strategy to make predictions.\n",
    "\n",
    "        Args:\n",
    "            strategy (str): Method used to predict. Valid options are \"mean\", \"median\". Defaults to \"mean\".\n",
    "            random_seed (int): Seed for the random number generator. Defaults to 0.\n",
    "\n",
    "        \"\"\"\n",
    "        if strategy not in [\"mean\", \"median\"]:\n",
    "            raise ValueError(\n",
    "                \"'strategy' parameter must equal either 'mean' or 'median'\"\n",
    "            )\n",
    "        parameters = {\"strategy\": strategy}\n",
    "        parameters.update(kwargs)\n",
    "\n",
    "        self._prediction_value = None\n",
    "        self._num_features = None\n",
    "        super().__init__(\n",
    "            parameters=parameters, component_obj=None, random_seed=random_seed\n",
    "        )\n",
    "\n",
    "    def fit(self, X, y=None):\n",
    "        if y is None:\n",
    "            raise ValueError(\"Cannot fit Baseline regressor if y is None\")\n",
    "        X = infer_feature_types(X)\n",
    "        y = infer_feature_types(y)\n",
    "\n",
    "        if self.parameters[\"strategy\"] == \"mean\":\n",
    "            self._prediction_value = y.mean()\n",
    "        elif self.parameters[\"strategy\"] == \"median\":\n",
    "            self._prediction_value = y.median()\n",
    "        self._num_features = X.shape[1]\n",
    "        return self\n",
    "\n",
    "    def predict(self, X):\n",
    "        X = infer_feature_types(X)\n",
    "        predictions = pd.Series([self._prediction_value] * len(X))\n",
    "        return infer_feature_types(predictions)\n",
    "\n",
    "    @property\n",
    "    def feature_importance(self):\n",
    "        \"\"\"Returns importance associated with each feature. Since baseline regressors do not use input features to calculate predictions, returns an array of zeroes.\n",
    "\n",
    "        Returns:\n",
    "            np.ndarray (float): An array of zeroes\n",
    "\n",
    "        \"\"\"\n",
    "        return np.zeros(self._num_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Required fields\n",
    "\n",
    "- `name`: A human-readable name.\n",
    "\n",
    "- `model_family` - EvalML [model_family](../autoapi/evalml/model_family/index.rst#evalml.model_family.ModelFamily) that this component belongs to\n",
    "\n",
    "- `supported_problem_types` - list of EvalML [problem_types](../autoapi/evalml/problem_types/index.rst#evalml.problem_types.ProblemTypes) that this component supports\n",
    "- `modifies_features`: A boolean that specifies whether the return value from `predict` or `predict_proba` should be used as features.\n",
    "\n",
    "- `modifies_target`: A boolean that specifies whether the return value from `predict` or `predict_proba` should be used as the target variable.\n",
    "\n",
    "Model families and problem types include:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.model_family import ModelFamily\n",
    "from evalml.problem_types import ProblemTypes\n",
    "\n",
    "print(\"Model Families:\\n\", [m.value for m in ModelFamily])\n",
    "print(\"Problem Types:\\n\", [p.value for p in ProblemTypes])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Required methods\n",
    "\n",
    "-  `__init__()` - the `__init__()` method of your estimator will need to call `super().__init__()` and pass three parameters in: a `parameters` dictionary holding the parameters to the component, the `component_obj`, and the `random_seed` value.\n",
    "\n",
    "- `fit()` - the `fit()` method is responsible for fitting your component on training data.\n",
    "\n",
    "- `predict()` - after fitting a component, the `predict()` method will take in new data and predict new target values. Note: a component must call `fit()` before `predict()`.\n",
    "\n",
    "- `feature_importance` - `feature_importance` is a [Python property](https://docs.python.org/3/library/functions.html#property) that returns a list of importances associated with each feature.\n",
    "\n",
    "If your estimator handles classification problems it also requires an additonal method:\n",
    "\n",
    "- `predict_proba()` - this method predicts probability estimates for classification labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Components Wrapping Third-Party Objects\n",
    "\n",
    "The `component_obj` parameter is used for wrapping third-party objects and using them in component implementation. If you're using a `component_obj` you will need to define `__init__()` and pass in the relevant object that has also implemented the required methods mentioned above. However, if the `component_obj` does not follow EvalML component conventions, you may need to override methods as needed. Below is an example of EvalML's [LinearRegressor](../autoapi/evalml/pipelines/index.rst#evalml.pipelines.LinearRegressor)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression as SKLinearRegression\n",
    "\n",
    "from evalml.model_family import ModelFamily\n",
    "from evalml.pipelines.components.estimators import Estimator\n",
    "from evalml.problem_types import ProblemTypes\n",
    "\n",
    "\n",
    "class LinearRegressor(Estimator):\n",
    "    \"\"\"Linear Regressor.\"\"\"\n",
    "\n",
    "    name = \"Linear Regressor\"\n",
    "    model_family = ModelFamily.LINEAR_MODEL\n",
    "    supported_problem_types = [ProblemTypes.REGRESSION]\n",
    "\n",
    "    def __init__(\n",
    "        self, fit_intercept=True, normalize=False, n_jobs=-1, random_seed=0, **kwargs\n",
    "    ):\n",
    "        parameters = {\n",
    "            \"fit_intercept\": fit_intercept,\n",
    "            \"normalize\": normalize,\n",
    "            \"n_jobs\": n_jobs,\n",
    "        }\n",
    "        parameters.update(kwargs)\n",
    "        linear_regressor = SKLinearRegression(**parameters)\n",
    "        super().__init__(\n",
    "            parameters=parameters,\n",
    "            component_obj=linear_regressor,\n",
    "            random_seed=random_seed,\n",
    "        )\n",
    "\n",
    "    @property\n",
    "    def feature_importance(self):\n",
    "        return self._component_obj.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hyperparameter Ranges for AutoML\n",
    "`hyperparameter_ranges` is a dictionary mapping the parameter name (str) to an allowed range ([SkOpt Space](https://scikit-optimize.github.io/stable/modules/classes.html#module-skopt.space.space)) for that parameter. Both lists and `skopt.space.Categorical` values are accepted for categorical spaces. \n",
    "\n",
    "AutoML will perform a search over the allowed ranges for each parameter to select models which produce optimal performance within those ranges. AutoML gets the allowed ranges for each component from the component's `hyperparameter_ranges` class attribute. Any component parameter you add an entry for in `hyperparameter_ranges` will be included in the AutoML search. If parameters are omitted, AutoML will use the default value in all pipelines. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Component Code\n",
    "\n",
    "Once you have a component defined in EvalML, you can generate string Python code to recreate this component, which can then be saved and run elsewhere with EvalML. `generate_component_code` requires a component instance as the input. This method works for custom components as well, although it won't return the code required to define the custom component. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from evalml.pipelines.components import LogisticRegressionClassifier\n",
    "from evalml.pipelines.components.utils import generate_component_code\n",
    "\n",
    "lr = LogisticRegressionClassifier(C=5)\n",
    "code = generate_component_code(lr)\n",
    "print(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this string can then be copy and pasted into a separate window and executed as python code\n",
    "exec(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can also do this for custom components\n",
    "from evalml.pipelines.components.utils import generate_component_code\n",
    "\n",
    "myDropNull = DropNullColumns()\n",
    "print(generate_component_code(myDropNull))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Expectations for Custom Classification Components\n",
    "EvalML expects the following from custom classification component implementations:\n",
    "\n",
    "- Classification targets will range from 0 to n-1 and are integers.\n",
    "- For classification estimators, the order of predict_proba's columns must match the order of the target, and the column names must be integers ranging from 0 to n-1"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
